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Abstract Structured output prediction aims to learn a predictor to predict 
a structured output from a input data vector. The structured outputs include 
vector, tree, sequence, etc. We usually assume that we have a training set of 
input-output pairs to train the predictor. However, in many real-world appli¬ 
cations, it is difficult to obtain the output for a input, thus for many training 
input data points, the structured outputs are missing. In this paper, we dis¬ 
cuss how to learn from a training set composed of some input-output pairs, 
and some input data points without outputs. This problem is called semi- 
supervised structured output prediction. We propose a novel method for this 
problem by constructing a nearest neighbor graph from the input space to 
present the manifold structure, and using it to regularize the structured out¬ 
put space directly. We define a slack structured output for each training data 
point, and proposed to predict it by learning a structured output predictor. 
The learning of both slack structured outputs and the predictor are unified 
within one single minimization problem. In this problem, we propose to mini¬ 
mize the structured loss between the slack structured outputs of neighboring 
data points, and the prediction error measured by the structured loss. The 
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problem is optimized by an iterative algorithm. Experiment results over three 
benchmark data sets show its advantage. 

Keywords Structured output prediction • Structured loss • Manifold 
regurlarization • Neighborhood smoothness • Gradient descent 


1 Introduction 

1.1 Background 

In machine learning community, the problems of pattern classification and re¬ 
gression has been studied well. Classification and regression are two most pop¬ 
ular supervised learning problems |551fIFl Bl lMllB9irmE51[?nf571l36112§ll^gi[5niB71 
SB] , In these problems, we usually have a training set of input-output pairs. 
The task is to train a predictive model from the training set to predict the 
output of a test input. In both the problems of classification and regression, 
the input is usually a feature vector. The output of classification problems 
is a binary class label, which represents a positive class or a negative class. 
The output of regression problems is a continues response variable. Recently, 
it is proposed that the output of a machine learning problem can be beyond a 
binary label and a continues response, and the output is structured in many 
real-world applications [3l H0llMllT3lll21l32lH4] . For example, in multi-class clas¬ 
sification problems, the output is a vector presenting which class the input 
belongs to. In hierarchical classification problems, the classes are organized as 
a tree, and each class is a node of the tree. Moreover, in natural language pars¬ 
ing problems, the output of a input language sequence is a sequence. When 
the structured output is considered, the transitional predictive model learn¬ 
ing algorithms cannot be used because the output does not to them. To solve 
this problem, the structured output prediction problem is proposed to learn a 
specific given structured output. This problem assume a training set of input- 
structured output pairs are available for the learning problem. However, in 
real-world applications, it is usually expensive or time-consuming to obtain 
a structured output for a input data point. Thus in many cases, we have a 
limited number of input-structure output pairs, and a large number of inputs 
without corresponding structured outputs. In this case, we try to learn a pre¬ 
dictive model with a large number of input data points and a small number 
of structured outputs. This problem is call semi-supervised structured output 
prediction [51 1211H5] . In this paper, we investigate this problem, and proposed 
a novel method to solve it. 


1.2 Related works 

There are some existing works on semi-supervised structured output prediction 
problem. We introduce them as follows. 
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— Altun et al. [2] proposed the problem of predicting multiple inter-dependent 
outputs by learning in a semi-supervise setting, and a method to solve this 
problem. The method is a maximum-margin method, and it uses the man¬ 
ifold of the input data space by exploring both the labeled and unlabeled 
data points. Moreover, this method is a inductive method and it learns 
a predictive model to predict the structured outputs for new coming test 
data points. 

— Brefeld and Scheffer [51 proposed a method for semi-supervised learning 
for structured output prediction. The method is a co-training method, 
and it is based on learning in a joint input-output space. It maximizes 
the consensus among different independent hypotheses, and extend it to 
a semi-supervised support vector machine learning algorithm in the joint 
input-output space. Moreover, the prediction loss of structured output is 
measured by a arbitrary structured loss function. 

— Suzuki et al. m proposed a semi-supervised structured output prediction 
method for sequence labeling task. This method is based on a combination 
of both generative and discriminative models. The objective of this method 
is constructed as a log-linear form, and the objective is a combination of 
discriminative structured predictor and generative model to use the input 
data points without structured output (unlabeled data points). Moreover, 
these unlabeled data points is utilized by the generative model to increase 
the sum of the discriminant functions for all outputs. 

— Li and Zemel m proposed a max-margin method for semi-supervised 
structured output prediction problem. This methods can use the both the 
discrete optimization algorithms and high order regularization based on 
the unlabeled data points. This method is shown to be closely relevant to 
the Posterior Regularization. 

Manifold learning is a popular topic in semi-supervised learning problems 
jmigsugiiTC] . It impose that if two data points are neighboring in the input 
space, their outputs should also be close to each other. Because the outputs 
of the data points are not complete, and most of the outputs of training data 
points are missing, it is important to infer the missing output from the available 
outputs by using the neighborhood relationship in the input space. Manifold 
learning has been a powerful regularization method in both classification and 
regression problems, and usually a squared £2 norm distance is used to measure 
how close two outputs (binary labels, or continues responses) are. However, in 
structured output prediction problem, the squared £2 norm distance cannot fit 
the structured outputs. In [2], a manifold regularization is also used. However, 
due to the complexity of the structured outputs, the regularization is not 
performed directly in the output space, but to the “parts” of the joint input- 
output space. A pair of “parts” is also compared using the squared £2 norm, 
so that the regularization term will not bring difficulty to the optimization of 
the problem. It is not guaranteed that regularizing the “parts” of input-output 
space can lead to the neighborhood smoothness in the output space. Actually, 
we can measure how close a pair of structured outputs are by a predefined 
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structured loss function. However, due to the complexity of this loss function, 
it is very difficult to optimize it to solve the parameter of the predictor. 


1.3 Our contributions 

To solve the problem mentioned above, in this paper, we propose to regular¬ 
ize the structured outputs directly in the structured output space. To avoid 
the difficulty of optimizing the structured loss function, we introduce a slack 
structured output for each training data point. This slack structured output 
presents the optimal output, and it is also treated as a variable during the 
learning procedure. For the labeled data points, their true structured outputs 
are available, we impose their slack structured outputs to be consistent with 
their true structured outputs. To prorogate the structured output from the la¬ 
beled data to the unlabeled data, we use the manifold information to present 
the connections between the data points. To present the manifold informa¬ 
tion, we construct a nearest neighbor graph in the input data space, and use 
it to regularize the output space directly. More specifically, if the inputs of 
two data points are neighbors, we also hope their slack structured outputs are 
close to each other. We use the structured loss function to measure how the 
compared structured outputs are close to each other. Moreover, to learn the 
predictive model, we learn the model parameter to fit the model to the slack 
structured outputs. In this way, we impose the slack structured outputs to 
be consistent to both the prediction results of the predictive model, and the 
structured outputs of its nearest neighbors. 

The predictive model is designed as a linear function of a joint input-output 
representation. We construct a objective function with respect to both the 
slack structured outputs and the predictive model parameter. In this objective 
function, we minimize the losses of the prediction results of the predictive 
model against the slack structured outputs, and the losses of the structured 
outputs of each pair of neighboring data points, simultaneously. The objective 
is optimized by an iterative algorithm, and the slack structured outputs and 
the predictive model parameter are updated alternately. 

The contributions of this work are of two folds: 

1. We solve the problem of manifold regularization in structured output space 
by introducing a slack structured output for each data point, both labeled 
and unlabeled, and comparing a pair of structured outputs of neighboring 
data points by the structured loss function. 

2. We propose a novel iterative algorithm to solve the slack structured out¬ 
puts and the predictive model parameters simultaneously. The optimiza¬ 
tion of the slack structured outputs are regularized by both the predictive 
model and the manifold. Moreover, we develop an efficient gradient descent- 
based method to update the predictive model parameter. This method 
is more efficient than the most popular optimization algorithm used in 
structured output prediction methods, cutting plane algorithm E1H0II!, 
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because it avoids the time-consuming quadratic programming problem of 
cutting plane algorithm. 


1.4 Paper organization 

The rest parts of this paper are organized as follows. In section [2] we intro¬ 
duce the proposed method, by first modeling the problem as a minimization 
problem, then solving it using an alternate optimization strategy, and finally 
developing an iterative algorithm. In section [3l the proposed is studied experi¬ 
mentally. It is compared to state-of-the-art semi-supervised structured output 
prediction methods. Its sensitivity to parameter and running time is also stud¬ 
ied. In section [TJ we give the conclusion and the future works. 


2 Proposed method 

In this section, we introduce the proposed method. The problem is modeled 
as a formulation of minimization problem, and it is then solved by a alternate 
optimization method with an iterative algorithm. 


2.1 Problem modeling 

We consider a problem of structured output prediction problem, where the 
input is a d-dimensional input vector, x £ R d , and the output is a structured 
output, y £ y, where y is the structured output space. We assume we have 
a training set of data points X = {0j}" =1 , where is the i-th data point, 
and n is the number of the data points in X. X is composed of two subsets, 
X = £\]U, where £ is the labeled data point set, and U is the unlabeled 
data point set. The data points of £ are presented as a input-output pairs, 
9i = where x* £ input vector of the *-th data point, y* £ y 

is its corresponding structured output. The data points in U only have the 
inputs while the structured outputs are missing, 0, = To learn the 

missing structured outputs for the data points in U, and a predictive model 
to predict the structured output for a test input, we consider the following 
problems to model the objective function. 

2.1.1 Regularizing the structured outputs by manifold 

We want to regularize the structured output by the manifold, but for the data 
points in U, the structured outputs are missing. To solve this problem, we 
introduce a slack structured output, Zi £ y, for each data point Oi\i : g ie x- This 
slack structured output Zi presents the optimal output we want to learn for 
the i-tli data point. 

For a labeled data point, 0j|i : 0 i e£, since its true structured output y, is 
known, we impost Zi — yi. For these unlabeled data points, we want 
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to predict their slack outputs by prorogating the output information from the 
labeled data points via a manifold. To present the manifold information, we 
construct a nearest neighbor graph from the input of data points of X. For 
the input vector x, of each data point from the inputs of data points in X 1 
we find its K nearest neighbors and denote the set of its nearest neighbors as 
TV",. To construct the graph, we treat each data point as a node of the graph, 
and put a edge between the i-th node and the j-th node if Xj G TV*. Denoting 
£ as the set of edges, we have 


£ = m, 6j) : 0i,9j € X, x y G Hi}. (1) 

The weight of the egde ( 6i,6j ), uJ%j, is assigned as a Gaussian kernel of the 
Euclidian distance between x, and Xj, 


OJij 



0, otherwise. 


( 2 ) 


ujij is a measurement of the similarity between a pair of neighboring data 
points in the input space. We try to map the similarity relationship from the 
input space to the structured space. For a pair neighboring data points, if 
there are similar in the input space, i.e., u>ij is large, their structured outputs 
should also be similar to each other, i.e., Z{ and Zj are close to each other. To 
measure how Zi and Zj are close to each other, we use a structured loss function, 
A(zi, Zj), to compare Z{ against Zj. A(zi, Zj) is a loss function to measure the 
loss if a structured label Zj is wrongly predicted as Zj. For example, when the 
structured output are the nodes of a tree, A(zi, Zj) is defined as the height of 
the first common ancestor of Zi and Zj in the tree. Naturally, if u>ij is large, we 
hope A{zi,Zj) is as mall as possible. Thus we propose to minimize A(zi,Zj) 
weighted by u>ij with regard to Zi and Zj, 


min < M(zi,--- ,z n ) = V' WijA(zi,Zj) 

zi ,Z n I Z ' 

s.t. Zi = yi,V i '■ Oi e C. 

In this way, we regularize the learning of slack structured outputs directly by 
the manifold, instead of regularizing the joints input-output space. 



2.1.2 Learning predictive model 

The problem of structured output prediction is to learn a predictive model / 
to predict a true structured output y G y from a input w G 

V f{x\ w) (4) 

where w is parameter of the predictive model /. To design the predictive 
model, we present a joint representation function to match a input x against a 
candidate structured output y' G y, <P(x,y') G where m is the dimension 
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of the joint representation. An example of this representation function is for the 
vector output, where y' is a vector, and ^(x, y') = x(^)y' is the Hadamard 
product of x and y'. We further design a matching function, g{x,y'; w), to 
obtain the matching score of x and y ', 

fl(x, y'\ w) = w T <P(x, y') (5) 

where w £ R m is the parameter of the matching function. The predictive 
model is based on the matching function, and it returns the optimal candidate 
structured output, y*, that maximized the matching scores, 

V* /(x; w) = arg max^^w T <2>(x, y') (6) 

The prediction error can can be measured by a loss function, A(y*, y), to com¬ 
pare the predicted structured output, y*, against the true structured output, 
y. The problem of structured output prediction is changed to the learning of 
the parameter vector w. 

Since for the data points in IA, the true structured outputs are missing, we 
use the slack structured outputs to guide the learning of the model parameter. 
We hope with the learned parameter vector, w, for the *-th training data point, 
the loss of predicting Zi as y*, A(y*,Zi), can be minimized. Thus we have the 
following optimization problem, 


n 

min y]A(yt,Zi), 

W.Zi,--- ,z n - 

2=1 


s.t. Zi = yi,V i : 9i £ C. 


(7) 


where y* is the predicted structured output of the i-tli data point. 

Due to the complexity of the loss function A , this problem is hard to 
optimize with regard to w directly. Instead of minimizing A(y*,zi) directly, 
we seek and minimize its upper bound. According to ([5jl . 


W T ^(x,;,y,*) > W T <2>(x;,Z;),V Zi £ W 

-T (o) 

=> W T (<?(x i; y*) - #(x i( Zi)) + A(y*,Zi) > A(y*,Zi). 

We replace the predicted structured output y* in © by a strutted output y" 
to maximize the left hand of the list line of (|8|). so that 


max [w T (<?(x, ; , y'l) - «?(x», zi)) + A{y ”, Zi)\ 
y"ey 

> w T (^(xj, y*) -$(xi,Zi)) + A(y*,Zi ) 

> A(y*,Zi). 

Thus a upper bound of A(y*,Zi) is obtained as follows, 

max [w T (<?(xj, y”) - <2>(xj, ^)) + A{y'(, Zi)\ 
y"€.y 

= w T (<?(xj, Vi) - $(x.i,Zi)) + A(vi, Zi), 


(10) 
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where Vi is the structured output that maximize the left hand of GSD. 

Vi = argmaXj,// ey [w T (<2>(x 4 , y”) - <£(x i5 z z )) + A(y'l, zt )] . (11) 

Replacing A(y*, Zi) by its upper bound in (11011 . we rewrite CD as 

min \ L(w,zi,- ■■ ,z n ) = y^[w T Vi) - <P{x Zl Zi)) + A(vi,Zi)\ >, 

l i =1 ) 

s.t. Zi = yi,y i : 0; G C. 

. (12) 

In this way, we transfer the problem of minimizing A(y* , Zi ) to the minimiza¬ 
tion of its upper bound. 

2.1.3 Reducing the model complexity 

To avoid the over-fitting problem, we try to reduce the complexity of the 
model. The complexity of the model can be measured by the squared t 2 norm 
of the model parameter vector, ||w||To reduce the complexity, we propose 
to minimize a regularization term R( w), 

minji?(w) = ^IMIaj . (13) 


2.1.4 Overall optimization problem 

The overall optimization problem of the proposed method is a combination of 
the three terms in ©, G2D. and (GSD, 


min ^ 0(w,zi,--- ,z n ) 

W, 2 (i ,••• ,Z n I 

= M(zi, ■■■ ,z n ) + CiZ/(w, z\, ■■■ ,z n ) + C 2 R(w) 

= 'y ( UijA(zi, Zj) 
i,r( s i, s j)££ 

n q 

+Ci ^2 [w T (&(-Xi,Vi) - $(xi,Zi)) + A(vi,Zi)\ + —~||w 

i=1 

s.t. Zi = yi,V i : 0i £ C, 

where C\ and C 2 are the tradeoff parameters. The first term of the objective 
function is to regularize the slack structured outputs by the manifold, the sec¬ 
ond term is to reduce the loss of prediction error, and the last term is to reduce 
the complexity of the model. In this problem, the learning of the slack struc¬ 
tured outputs are regularized by three information sources: the manifold, the 
known true structured outputs of the labeled data points, and the prediction 
results of the predictive model. 
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2.2 Problem optimization 

To solve the problem in (fl4l) . we use an alternate optimization strategy. In 
an iterative algorithm, when the model parameter vector w is considered, the 
slack structured outputs z 1, • • • ,z n are fixed. When z 1, • • • ,z n are considered, 
w is considered. In the following subsections, we will discuss how to solve w 
and z i, • • ■ ,z n respectively. 


2.2.1 Solving w while fixing z i, • • ■ ,z n 

When we consider the model parameter vector w, the slack structured outputs 
z i, • • ■ ,z n are fixing. We remove the terms in (1141) irrelevant to w, and obtain 
the following problem, 


min 

W 


Oi(w) = Cl Y [w T (<£(xi,Ui) - &(-Xi,Zi)) + A{vi,Zi)\ 
2=1 



(15) 

Please note that v t is also a function of w according to m■ However, because 
it is coupled with a maximization problem, thus it is hard to optimize with 
regard to w directly. Thus we use the strategy of expectation-maximization 
algorithm, update U; by using the solutions of w and z i, • • • ,z n in previous 
iteration, and then fix it when w is optimized in current iteration. After Vi 
is fixed, we use the gradient descent algorithm to update w. To seek the 
minimization of Oi(w), w should descent to the direction of gradient. The 
gradient function of Oi(w) is 


VOi(w) = Ci Y (0(xi,Vj) -$(xi,Zi)) + C 2 w. (16) 

2=1 


The updating rule is 


w <—w — TjVOi (w) 

=W — Tj 


Cl ^2 («P(xi, Vi) - «P(xi, Zi)) + C 2 W 
2=1 

n 

=(1 - rjC 2 ) w - gCi Y2 ^i) - <?(xj, z t )) 


(17) 


where ?] is the descent step. 
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2.2.2 Solving z 1 , • • • ,z n while fixing w 

We fix the w when zi, ■ ■ ■ , z n are considered, and remove the terms irrelevant 
to zi, ■ ■ ■ ,z n . The following problem is obtained, 


min < 0 2 (zi, ■ ■ ■ ,z„) = V' u>ijA(zi,Zj) 

Zi, ■■■ ,z n I z. -✓ 

n 1 (18) 

+C\ 22 [-w T ^(xj, Zi) + A(vi, Zi)] > , 
i=i J 

S.t. Zi — ^j,V l . 0i G H.. 

It is difficult to optimize all the slack structured outputs z i, • • • ,z n simultane¬ 
ously. Thus we chose to update them one by one. When one slack structured 
output Zi is considered, other ones Zj\j^i are fixed. In this case, we obtain the 
following problem for the i-th data point, 


min 

Zi 


0 3 (zi) = 


E 


UijA(zi,Zj) + 


E 


Uj>iA(zj',Zi) 


+Ci [-W T 0 (x i5 Zi) + A(Vi, Zi)] 


(19) 


s.t. Zi = yi,y i : 9i € C. 


From the formulation, we can see that the optimal Zi should be consistent to 
the slack structured outputs of its nearest neighbors, and the prediction result 
of the predictive model. The solution for this problem can be obtained by a 
linear search in the structured output space, 

_ f argma x y , ey 0 3 (y'), if O.eU 
1 \ yi, otherwise. ^ ' 


2.3 Iterative algorithm 

We summarize the developed iterative learning algorithm in Algorithm ©• 
From this algorithm, we can see that the iterations are repeated T times. In 
each iteration, we first update V{ and Zi for each data point, and then update 
w. This algorithm is named as manifold regularized structured output learning 
algorithm (MRSO). 
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Algorithm 1 Iterative algorithm of MRSO. 

Input: Training set of data points X\ 

Input: Tradeoff parameters Ci and C 2 ; 

Input: Maximum number of iterations, T; 

Initialize model parameter vector w°; 

Initialize the slack structured outputs z9, • • • , 2 :®; 
for t = 1, • • , T do 
for i = 1, • ■ • , n do 

Update v\ of the z-th data point by fixing z l ~ 1 and w t—1 , 

v\ = argmax^/g-y [w t_lT ( '■) - <Z>(xi, + A(y", z‘ _1 )] - (21) 

Update zj of the z-th data point by fixing w t— 1 , z t -~ 1 \j^. i and v\. 
if Oi dlA then 


4 =argmin tf / ey \ ^ wy A{y' i , z* J ) + u) ri A(z*, \ y') 

A)e e 

+Ci [-w i_lT <2j(x i ,y') + Zl(4,2/-)] 


else 
4 = yi\ 

end if 
end for 

Update w* by fixing and z\, • • • , z^, 

n 

W* = (1 - vC 2 ) W l_1 - ??Ci (<P(x, ,4) - ^(Xi,4)) ; (23) 

i=l 

end for 

Output: w T and , ■ ■ ■ , z^. 


3 Experiments 

3.1 Data sets 

— The first data set we used is Cora data set [T5]- The output of this data 

set is the class label vector of multi-class classification problem. This data 
set is a linked computer science paper data set. Each paper is a treated 
as a data point. In this data set, there are 9,947 data points. The papers 

without a reference list is removed from the data set, and 9,555 papers are 
left. All the papers belong to the 8 classes. To construct a feature vector 
from a paper, we extract a term frequency vector, and a link view vector, 
and concatenate them as a feature vector 138] . 

For each data point, x^, we construct a vector output, y. t = [ yn , • • • , j/jg] £ 
{1,0} 3 * * * * 8 , as the structured output. This vector, y i: is a 8-dimensional binary 
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Fig. 1 Tree structured outputs of SUN data set. 


vector. If this data point belongs to the k- class, then the fc-th element of 
this vector is 1, or 0 otherwise, 


J1, if Xi belongs to the k — th class, 
[0, otherwise. 


(24) 


We further define the joint input-output representation function as <£(x, y) = 
x (g) y. To measure the prediction error of predicting y i as y* by the 0 — 1 
loss, and define A(y*,yf), 


A(y*,y i) 


{ 1, if y* = yi, 

(0, otherwise. 


(25) 


— The second data set is SUN data set [H] . The outputs of this data set are 
the nodes of a tree structure. In this data set, there a 2,000 images, and 
they belongs to 15 different classes of scenes. The classes are organized as 
a scene tree. The root node is scene, and it has three child nodes, which 
are indoor, outdoor land space, and outdoor man-made. These three child 
nodes have further 15 leaf nodes, which are the 15 classes. Thus there are 
19 nodes in the tree in total. The scene tree is shown in figure [TJ Each 
image belongs to one of the classes. To represent the image, we extract the 
HOG features from the image and use them as visual features. In this case, 
the structured output is a node of the tree. We present a output of the i-th 
data point by using a 19-dimensional binary vector y. t £ {1, 0} 19 . The fc-th 
element of y t is defined as 


{ 1, if the k — th node is the class of x.^, 

or it is a ancestor of the class of x^, (26) 

0, otherwise. 

We also define the joint input-output representation function as d>(x, y) = 
x (g> y. The structured loss function A( y*,y, ; ) is defined as the height of 
the first common ancestor of the predicted output y* and true output y, ; . 
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— The third data set is a subset of Biocreative data set, provided by the 
special session of CoNLL2002 m- The outputs of this data set is label 
sequences. This set contains 500 sentences from biomedical papers. Each 
word in a sentences can be labeled as one of the nine named entities. 
The problem is to assign a sequence of named entity labels to a sentence. 
Thus the output of a sentence of m words, x,, is a sequence of labels, 
Vi = {yn , • • • where yik is the label of the k -th word. The joint 

input-output representation function, is defined as the histogram 

of state transition and a set of features describing the emissions [22] . The 
structured loss function to compare a predicted label sequence y* against 
the truce label sequence yi is defined as follows, 



(27) 


3.2 Experiment setup 

To perform the experiment, we employ the 10-fold cross validation. A entire 
data set is split into ten folds randomly. Each fold is used as a test set in 
turn. The rest nine folds are combined as a training set. Moreover, we further 
select two folds from the training set randomly as labeled data set, and leave 
the rest seven folds as unlabeled data set. The proposed method is applied to 
the training set to learn the predictive model parameter, and the structured 
outputs of the unlabeled training data points. Moreover, the learned predictive 
model are also applied to the test set to predict the structured outputs of 
the test data points. The prediction performance is evaluated by the average 
structured loss (ASL) over the test set, T, 



(28) 


3.3 Experiment results 

In this section, we study the proposed method experimentally. We first com¬ 
pare it to the state-of-the-art semi-supervised structured output prediction 
methods. Then we study the convergency of the proposed iterative algorithm. 
Finally, we study how the algorithm performs over different tradeoff parame¬ 
ters. 

3.3.1 Comparison to state-of-the-art 

We compare the proposed MRSO algorithm against several state-of-the-art 
semi-supervised learning methods for structured output prediction. We list 
them as follows: 
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— Semi-supervised structured (STR) max-margin optimization method [23, 

— Co-support vector learning for structured output variables (CoSVM) [5], 

— Semi-supervised structured output learning based on a hybrid generative 
and discriminative models (HySOL) [21], and 

— High order regularization for semi-supervised learning of structured output 
problems (HOR) [15; . 

The boxplots of the 10-fold cross validation are given in figure [2] From results 
in figure [21 we can easily determine that the proposed MRSO algorithm out¬ 
performs the other algorithms over all three data sets. For example, in figure 
|2(a)[ we can see that the median value of the ASL values of the MRSO is as 
low as about 0.4, while the median ASL of the second best method, HOR, 
is as high as 0.5. For all other three methods, the media values of ASL are 
higher than 0.5, which are around 0.55. The outperforming of the proposed 
algorithm MRSO over the compared methods is even more obvious in |2(b)| 
In this figure, only the proposed MRSO method achieves a median value of 
ASL lower than 0.6, and those median values of the compared methods are 
higher than 0.7. Moreover, it seems that HOR and HySOL performs better 
than CoSVM and STR. 


3.3.2 Algorithm convergency 

The proposed algorithm is an iterative algorithm. We also study the conver¬ 
gency of the algorithm by plotting the responses of the objective function 
of different iterations. This experiment is conducted over the Cora data set. 
The curve is given in figure [3] From this figure, we can observe the iterative 
algorithm can converge at some point of iteration. For example, the objec¬ 
tive decreases significantly from the first iteration to the 60-th iteration, and 
then the objective stays stable after the 60-the iteration. This indicates the 
algorithm converges. 


3.3.3 Tradeoff parameter analysis 


In the objective function of our formulation (THl) . there are two tradeoff pa¬ 
rameters, C\ and Ci- We also want to know how these parameters effect the 
performance of our algorithm. To this end, we plot the curve of the different 
values of ASL of different values of C\ and C%. The curves are shown in figure 
[4] Please note that the data in figure [4] is obtained by conducting experiments 
in Cora data set. From this figure, we can observe that our algorithm is table 
to both the parameters. In figure [4(a) when the parameter C\ varies from 0.1 
to 1000, the range of ALS of MRSO is [0.40,0.45], and the variance is very 
small. Moreover, in figure [4(b)] we can also observe that the range of ALS of 
MRSO is [0.40,0.43] when C 2 is varied. 
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Cora data set 



MRSO HOR HySOL CoSVM 
Methods 


(a) Core data set 


SUN data set 



MRSO HOR HySOL CoSVM 
Methods 

(b) SUN data set 


Biocreative data set 



MRSO HOR HySOL CoSVM 
Methods 


(c) Biocreative data set 


Fig. 2 Results of comparison to state-of-the-art. 
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iteration 


Fig. 3 Responses of objective function of different iterations. 


4 Conclusion and future works 

This paper investigate the problem of semi-supervised structured output pre¬ 
diction. We propose to use the manifold structure to regularize the structured 
outputs directly. However, in this problem, many training data points only 
have input feature vectors, while the structured outputs are missing. To solve 
this problem, we propose a slack structured output for each training data 
point, either labeled or unlabeled. Moreover, we construct a nearest neigh¬ 
bor graph in the input space to present the manifold structure, and use it to 
regularize the learning of the slack structured outputs. We impose the slack 
structured outputs to be consistent to both the manifold structure and the 
prediction results of a structured output predictor. More specifically, we use 
a structured loss function to measure how a pair of structured output fits to 
the manifold distribution. A unified objective is constructed for the learning 
of both slack structured outputs and the predictive model parameter, and 
an iterative algorithm is proposed to minimize this objective function. The 
experiment results show that the proposed algorithm outperforms the state- 
of-the-art semi-supervised structured output prediction methods. 
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